Disclosure Control in Business Data - Experiences with Multiply Imputed Synthetic Datasets for the German IAB Establishment Survey
نویسنده
چکیده
Generating synthetic datasets based on the ideas of multiple imputation is an innovative method for statistical disclosure control. The basic idea is to replace the values for some confidential variables X with several draws from the posterior predictive distribution of X given some non confidential variables Y. Since the synthetic values are based on models for the joint distribution of the data, many dependencies between the variables are preserved in the released data. Furthermore, the method can be applied to discrete and continuous variables and constraints like non negativity can be incorporated directly at the modeling stage. Especially for business surveys, where usual disclosure control methods like swapping or microaggregation would have to be applied on a very high level because of the skewness of the data, the approach yields very promising results. The German Institute for Employment Research (IAB) is developing synthetic datasets for one of its establishment surveys, the IAB Establishment Panel. An actual release of a scientific use file based on synthetic datasets for the last wave of the Panel is planned for 2009. In this paper we discuss the challenges of implementing this approach for a large survey and give preliminary results on the applicability of these ideas for real world datasets.
منابع مشابه
Synthetic Datasets for the German IAB Establishment Panel
Disseminating microdata to the public that provide a high level of data utility while at the same time guaranteeing the confidentiality of the survey respondent is a difficult task. Generating multiply imputed synthetic datasets is an innovative statistical disclosure limitation technique with the potential of enabling the data disseminating agency to achieve this twofold goal. So far, the appr...
متن کاملComparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel
For data sets considered for public release, statistical agencies have to face the dilemma of guaranteeing the confidentiality of survey respondents on the one hand and offering sufficiently detailed data for scientific use on the other hand. For that reason a variety of methods that address this problem can be found in the literature. In this paper we discuss the advantages and disadvantages o...
متن کاملJoint NSF-Census-IRS Workshop on synthetic data and confidentiality protection
Many users of synthetic data, or any data altered to protect confidentiality, are understandably skeptical that analyses done on synthetic data will yield reasonable results. In this talk, I present recent research on methods to improve the analytic validity of synthetic data. Specifically, I talk about nonparametric methods of data synthesis that have the potential to capture complex distribut...
متن کاملEditing and multiply imputing German establishment panel data to estimate stochastic production frontier models
This paper illustrates the effects of item-nonresponse in surveys on the results of multivariate statistical analysis when estimation of productivity is the task. To multiply impute the missing data a data augmentation algorithm based on a normal/Wishart model is applied. Data of the German IAB Establishment Panel from waves 2000 and 2001 are used to estimate the establishment’s productivity. T...
متن کاملLikelihood Based Finite Sample Inference for Singly Imputed Synthetic Data Under the Multivariate Normal and Multiple Linear Regression Models
In this paper we develop likelihood-based finite sample inference based on singly imputed partially synthetic data, when the original data follow either a multivariate normal or a multiple linear regression model. We assume that the synthetic data are generated by using the plug-in sampling method, where unknown parameters in the data model are set equal to observed values of their point estima...
متن کامل